NumPy ユーザーガイド：パフォーマンスのギャップとは？なぜ NumPy を拡張するのか？

NumPy は C で構築されていますが、計算負荷の高いアルゴリズムの一部はベクトル化の壁にぶつかります。これは、Python の動的性質に起因する固有の遅延が、高レベルな抽象化の利点を上回る場合に起こります。

1. インタプリタのコストとボクシング

標準的な Python ループの各反復では、動的型チェックと参照カウンティングが行われます。NumPy スカラを使っても、生の C データを Python オブジェクトに「ボクシング」することで、$\text{logit}(p) = \log(p/(1-p))$ のような関数において大きなボトルネックが発生します。C でエッジケースを処理するほうが劇的に高速です：

>>> logit(0) → -inf
>>> logit(1) → inf
>>> logit(2) → nan
>>> logit(-2) → nan

2. 中間配列の肥大化

純粋な NumPy 式は、各部分演算ごとに一時的なメモリバッファを作成します。C-API を通じた拡張により、 カーネル融合カーネル融合が可能になり、ロジット変換を一回のパスで計算でき、補助的なメモリオーバーヘッドなしに済みます。

3. 空間的依存関係

隣接要素へのアクセスパターンを含む演算、たとえば 2 次元ステンシル:

$$B(I, J) = A(I, J) + (A(I-1, J) + A(I+1, J) + A(I, J-1) + A(I, J+1)) \cdot 0.5D0 + (A(I-1, J-1) + A(I-1, J+1) + A(I+1, J-1) + A(I+1, J+1)) \cdot 0.25D0$$

スライシングによる表現では、重複したメモリコピーが避けられないため、効率的に実装するのは困難です。C 拡張機能を使用すれば、直接かつキャッシュ整列されたポインタ演算が可能になります。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of the 'Interpreter Tax' in Python loops?

Fixed memory allocation for arrays.

Dynamic type-checking and object boxing per iteration.

Lack of support for floating-point math.

Automatic garbage collection of global variables.

QUESTION 2

How does 'Kernel Fusion' improve performance in C-extensions?

By increasing the number of CPU cores used.

By combining multiple operations into a single pass over memory.

By converting all data into 8-bit integers.

By bypassing the C-API entirely.

QUESTION 3

Why are stencil operations problematic for pure NumPy vectorization?

NumPy does not support 2D arrays.

They require redundant memory copies when expressed via slicing.

They cannot be computed using floating-point numbers.

The logit function is required for all stencils.

QUESTION 4

What happens when a computation hits the 'Vectorization Wall'?

The computer runs out of disk space.

Context-switching overhead outweighs the benefits of high-level vectorization.

The GPU takes over the calculation automatically.

NumPy raises a VectorizationError.

QUESTION 5

Handling logit domain errors (like logit(2)) is faster in C because:

Python doesn't know what 'nan' is.

It can be handled at the hardware level by the FPU/SIMD units.

C automatically ignores all errors.

The C-API converts all 'nan' values to zero.